Skip to content

fix: remove deprecated code_mapping, dev, refresh_cache from examples and README#935

Merged
jhnwu3 merged 1 commit intosunlabuiuc:masterfrom
haoyu-haoyu:fix/deprecated-code-mapping
Apr 8, 2026
Merged

fix: remove deprecated code_mapping, dev, refresh_cache from examples and README#935
jhnwu3 merged 1 commit intosunlabuiuc:masterfrom
haoyu-haoyu:fix/deprecated-code-mapping

Conversation

@haoyu-haoyu
Copy link
Copy Markdown
Contributor

Summary

Remove references to deprecated code_mapping, dev, and refresh_cache parameters from example scripts and README. These parameters belonged to the legacy BaseEHRDataset API and are no longer accepted by the v2.0 MIMIC3Dataset/MIMIC4Dataset (based on BaseDataset).

Files updated

File Change
README.rst Remove code_mapping={"NDC": "CCSCM"} from quickstart example
examples/mortality_prediction/mortality_mimic3_grasp.py Remove code_mapping, dev, refresh_cache
examples/drug_recommendation/drug_recommendation_mimic4_gamenet.py Remove code_mapping, dev, refresh_cache
examples/patient_linkage_mimic3_medlink.py Remove code_mapping, dev, refresh_cache
leaderboard/utils.py Remove code_mapping, dev, refresh_cache from MIMIC3/MIMIC4 loaders

Out of scope (for follow-up PRs)

  • Task file docstrings (pyhealth/tasks/*.py) — these contain code_mapping in >>> doctest examples that also need updating
  • pyhealth/datasets/mimicextract.py — still uses the legacy API (not yet migrated to v2.0)
  • chat-assistant/corpus/ — auto-generated text corpus, will update when source docs are fixed

Fixes #535

The v2.0 MIMIC3Dataset/MIMIC4Dataset (based on BaseDataset) no longer
accepts code_mapping, dev, or refresh_cache parameters. These were
part of the legacy BaseEHRDataset API.

Update README.rst, example scripts, and leaderboard utilities to use
the current v2.0 API.

Note: task file docstrings and pyhealth/datasets/mimicextract.py
still reference code_mapping but are left for separate PRs since
mimicextract.py has not yet been migrated to v2.0.

Fixes sunlabuiuc#535
Copy link
Copy Markdown
Collaborator

@jhnwu3 jhnwu3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@jhnwu3 jhnwu3 merged commit d7641e0 into sunlabuiuc:master Apr 8, 2026
1 check passed
rmumme2 added a commit to deadlywrong/PyHealth that referenced this pull request Apr 8, 2026
* Update/core docs (sunlabuiuc#889)

* add new docs

* index

* overview page added

* clean up and fix old details

* [Conformal EEG] TUEV/TUAB Compatibility (sunlabuiuc#894)

* Fixed repo to be able to run TUEV/TUAB + updated example scripts

* Args need to be passed correctly

* Minor fixes and precomputed STFT logic

* Fix the test files to reflect codebase changes

* Args update

* Updated Conformal Test Scripts (sunlabuiuc#895)

* Fixed repo to be able to run TUEV/TUAB + updated example scripts

* Args need to be passed correctly

* Minor fixes and precomputed STFT logic

* Fix the test files to reflect codebase changes

* Args update

* test script fixes

* fix: prevent batch_size=1 crashes, add weights_only to torch.load, fix device/contiguity issues (sunlabuiuc#901)

1. Fix bare .squeeze() calls that silently remove the batch dimension
   when batch_size=1, causing wrong results during single-sample inference:
   - concare.py: .squeeze() → .squeeze(dim=-1) and .squeeze(dim=1)
   - agent.py: .squeeze() → .squeeze(dim=-1) or removed (already 1-D after .sum/.mean)

2. Add weights_only=True to all torch.load() calls for PyTorch 2.6+
   compatibility and security (prevents arbitrary code execution via
   pickle deserialization):
   - trainer.py, biot.py, tfm_tokenizer.py (2 calls), kg_base.py

3. Add .contiguous() before pack_padded_sequence in RNNLayer to prevent
   cuDNN errors with non-contiguous input tensors (fixes sunlabuiuc#800)

4. Fix StageNet device mismatch — tensors were created on CPU instead of
   the input tensor's device, causing crashes during GPU training:
   - torch.zeros/ones(...) → torch.zeros/ones(..., device=device)
   - time == None → time is None (PEP8)

* fix: improve research reliability — metrics mutation, eval placement, reproducible splits (sunlabuiuc#902)

Three fixes that directly affect the trustworthiness of research results:

1. regression.py: kl_divergence computation mutated the input arrays
   (x, x_rec) in-place via clamping and normalization. When multiple
   metrics were requested (e.g., ["kl_divergence", "mse", "mae"]),
   mse/mae were computed on the modified arrays, producing incorrect
   values. Fixed by operating on copies.

2. trainer.py: model.eval() was called inside the per-batch loop in
   inference(), redundantly setting eval mode on every batch. Moved
   to before the loop — called once as intended.

3. splitter.py: all split functions used np.random.seed() which mutates
   the global numpy random state. This causes cross-contamination when
   multiple splits are called sequentially, making experiments
   non-reproducible. Replaced all 7 occurrences with
   np.random.default_rng(seed) which creates an isolated RNG instance.
   The existing sample_balanced() already used default_rng correctly.

* fix: port GRASP model to PyHealth 2.0 API (fixes sunlabuiuc#891) (sunlabuiuc#903)

The GRASP model was completely non-functional in PyHealth 2.0 because it
still used the legacy 1.x BaseModel constructor and removed helper
methods (get_label_tokenizer, add_feature_transform_layer,
prepare_labels, padding2d/3d).

Changes:
- Rewrite GRASP.__init__ to use the 2.0 pattern (matching ConCare):
  - super().__init__(dataset=dataset) instead of passing feature_keys/label_key/mode
  - EmbeddingModel(dataset, embedding_dim) replaces manual type dispatch
  - self.get_output_size() without arguments
  - Auto-derive feature_keys, label_key, mode from dataset schemas
- Rewrite GRASP.forward to use EmbeddingModel:
  - embedded, masks = self.embedding_model(kwargs, output_mask=True)
  - Labels from kwargs[self.label_key].to(self.device)
  - Eliminates ~60 lines of manual tokenization/padding/embedding
- Remove eliminated parameters: feature_keys, label_key, mode, use_embedding
- Update imports: SampleEHRDataset → SampleDataset, add EmbeddingModel
- Update docstring examples to 2.0 API
- Update __main__ block to use create_sample_dataset
- Add tests/core/test_grasp.py with 8 test cases covering:
  initialization, forward/backward, embed extraction, GRU/LSTM backbones

GRASPLayer (the algorithm core) is unchanged.

* making the PyHealth Research Initiative page way less confusing and dense (sunlabuiuc#907)

just doc things

* add new reference to the top of the pyhealth page for our new project page so users who join can hopefully find a more easy to navigate page that isn't so documentation heavy to find what they're looking for (sunlabuiuc#910)

* [Conformal EEG] Conformal Testing Fixes (sunlabuiuc#909)

* Fixed repo to be able to run TUEV/TUAB + updated example scripts

* Args need to be passed correctly

* Minor fixes and precomputed STFT logic

* Fix the test files to reflect codebase changes

* Args update

* test script fixes

* dataset path update

* fix contrawr - small change

* divide by 0 error

* Incorporate tfm logic

* Fix label stuff

* tuab fixes

* fix metrics

* aggregate alphas

* Fix splitting and add tfm weights

* fix tfm+tuab

* updates scripts and haoyu splitter

* fix conflict

* Remove weightfiles from tracking and add to .gitignore

Weight files are large binaries distributed separately; untrack all
existing .pth files under weightfiles/ and add weightfiles/ to
.gitignore so they are excluded from future commits and the PR.

Made-with: Cursor

* feat: add optional dependency groups for graph and NLP extras (sunlabuiuc#904)

* feat: add optional dependency groups for graph and NLP extras (sunlabuiuc#890)

Add [project.optional-dependencies] to pyproject.toml so users can
install domain-specific dependencies via pip extras:

  pip install pyhealth[graph]   # torch-geometric for GraphCare, KG
  pip install pyhealth[nlp]     # editdistance, rouge_score, nltk

The codebase already uses try/except ImportError with HAS_PYG flags
for torch-geometric, and the NLP metrics define their required
versions in each scorer class. This change exposes those dependencies
through standard Python packaging so pip can resolve them.

Version pins match the requirements declared in the code:
- editdistance~=0.8.1 (pyhealth/nlp/metrics.py:356)
- rouge_score~=0.1.2 (pyhealth/nlp/metrics.py:415)
- nltk~=3.9.1 (pyhealth/nlp/metrics.py:397)
- torch-geometric>=2.6.0 (compatible with PyTorch 2.7)

Closes sunlabuiuc#890

* fix: move optional-dependencies after scalar fields to fix TOML structure

Move [project.optional-dependencies] from between dependencies and
license (line 49) to after keywords (line 62), before [project.urls].

In TOML, a sub-table header like [project.optional-dependencies]
closes the parent [project] table, so placing it before license and
keywords caused those fields to be excluded from [project]. This
broke CI validation.

Verified with tomllib that all project fields (name, license,
keywords, optional-dependencies, urls) parse correctly under
[project].

* Add/mm retain adacare (sunlabuiuc#885)

* init commit

* RNN memory fix

* add example scripts here

* more bug fixes?

* commit to see new changes

* add test cases

* fix basemodel leakage of args

* fixes to tests and examples

* more examples

* reduce unnecessary checks, enable crashing on when a cache is invalid

* fix nested sequence rnn problems

* fixes for the concare and transformer model exploding in memory

* fix concare merge conflict again

* fix for 3D channel for CNN

* update and delete defunct docs

* better loc comparisons and also a bunch of model fixes hopefully

* test case updates to match our bug fixes

* fix instability in calibration tests for CP


tldr; Fixes a variety of dataset loading, run bugs, splits for TUEV/TUAB, adds a good number of performance fixes for Transformer and Concare. We can always iterate on our fixes later.

* concare fix (sunlabuiuc#920)

Bypassing a PR review, because of speed/reviewer bottleneck reasons.

* fix pixi warning and version format for backend (sunlabuiuc#917)

* fix: remove deprecated code_mapping, dev, refresh_cache from examples (sunlabuiuc#935)

The v2.0 MIMIC3Dataset/MIMIC4Dataset (based on BaseDataset) no longer
accepts code_mapping, dev, or refresh_cache parameters. These were
part of the legacy BaseEHRDataset API.

Update README.rst, example scripts, and leaderboard utilities to use
the current v2.0 API.

Note: task file docstrings and pyhealth/datasets/mimicextract.py
still reference code_mapping but are left for separate PRs since
mimicextract.py has not yet been migrated to v2.0.

Fixes sunlabuiuc#535

---------

Co-authored-by: John Wu <54558896+jhnwu3@users.noreply.github.com>
Co-authored-by: Arjun Chatterjee <arj0jeechat@gmail.com>
Co-authored-by: haoyu-haoyu <85037553+haoyu-haoyu@users.noreply.github.com>
Co-authored-by: Paul Landes <landes@mailc.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Example code still uses deprecated argument "code_mapping"

2 participants